Data analytics is the process of analyzing raw data and generating actionable insights. It comprises of the processes, tools and techniques of data analysis and management, including the collection, organization, and storage of data. Organizations use data analytics to gain competitive advantage by improving their performance and operational efficiency. Data analytics is performed on a variety of big data sets, like transactions, click streams, server logs, electronic health records, insurance claims, etc. Different analytical techniques and algorithms can be applied on these data sets to accomplish different objectives. These different types of analytical techniques are colloquially called:-
Irrespective of the type of analytics being performed, the basis of every method or algorithm in data analytics is descriptive/inferential statistics and machine learning. In this analysis report, we will leverage descriptive statistics to generate insights from the data.
Problems Statement :
CarDekho.com is India’s leading car search venture that helps users buy cars that are right for them. It’s website and app carry rich automotive content such as expert reviews, detailed specs and prices, comparisons as well as videos and pictures of all car brands and models available in India.
We have the sales data of all the cars sold during the time frame of 1983 to 2020. We are going to analyse this data set in order to help them expand their business, gain and retain customers, and stand out the competitions they face.
The data set has 8128 data points with 13 features in it related to :
Later, we will implement learning algorithms and modelling techniques to understand the patterns and achieve high quality, consistent results targeting the following points :
Importing the libraries required for the analysis
library(tidyverse)
library(grid)
library(gridExtra)
library(dplyr)
library(ggplot2)
library(reshape2)
library(DT)
library(RColorBrewer)
library(data.table)
library(knitr)
library(caret)
library(stringr)
library(RANN)
library(data.table)
library(vtable)
library(scales)Importing the data set for analysis
car_sales <- read.csv("CarDekho Sales.csv")Head of data
# View the head and summary of the data
DT::datatable(head(car_sales, 5), rownames = FALSE)Dimension and summary of data
# View the dimension and summary of the data
dim(car_sales)## [1] 8128 13
summary(car_sales)## name year selling_price km_driven
## Length:8128 Min. :1983 Min. : 29999 Min. : 1
## Class :character 1st Qu.:2011 1st Qu.: 254999 1st Qu.: 35000
## Mode :character Median :2015 Median : 450000 Median : 60000
## Mean :2014 Mean : 638272 Mean : 69820
## 3rd Qu.:2017 3rd Qu.: 675000 3rd Qu.: 98000
## Max. :2020 Max. :10000000 Max. :2360457
##
## fuel seller_type transmission owner
## Length:8128 Length:8128 Length:8128 Length:8128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## mileage engine max_power torque
## Length:8128 Length:8128 Length:8128 Length:8128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## seats
## Min. : 2.000
## 1st Qu.: 5.000
## Median : 5.000
## Mean : 5.417
## 3rd Qu.: 5.000
## Max. :14.000
## NA's :221
Observation - There are 8128 observations with 13 features in the data set.
Extract value from the ‘Mileage’ column
The feature ‘mileage’ contains nominal categorical values suffixed with unit measurements of fuel efficiency. We need to clean this feature by removing the measurements (like kmpl, km/kg, etc.) and convert the data type of this feature from categorical to numeric for further utilization in analysis. We are using ‘gsub’ function of the ‘base’ package to perform this operation.
# Replace the 'kmpl, km/kg' string from data points in the 'mileage' feature. Convert to numeric data type.
car_sales$mileage <- as.numeric(gsub('[a-zA-Z/ ]', '', car_sales$mileage))Extract value from the ‘Engine’ column
The feature ‘engine’ contains nominal categorical values suffixed with unit measurements of capacity. We need to clean this feature by removing the measurements (like CC) and convert the data type of this feature from categorical to numeric for further utilization in analysis. We are using ‘gsub’ function of the ‘base’ package to perform this operation.
# Replace the 'CC' string from data points in the 'engine' feature. Convert to numeric data type.
car_sales$engine <- as.numeric(gsub('[a-zA-Z/ ]', '', car_sales$engine))Extract value from the ‘Max Power’ column
The feature ‘max_power’ contains nominal categorical values suffixed with unit measurements of Brake Horse Power (BHP). We need to clean this feature by removing these measurements (like bhp) and convert the data type of this feature from categorical to numeric for further utilization in analysis. We used the ‘str_remove’ function from the ‘stringr’ package to perform this operation.
car_sales$max_power <- as.numeric(str_remove(car_sales$max_power, " [a-z]+"))Extract torque value from the ‘Torque’ column
The feature ‘torque’ contains nominal categorical values suffixed with imperial measurements of Newton meters along with ‘Revolutions per minute’. We need to extract the torque values from this feature into another feature which will be further utilized in analysis and have the data type of this new feature as numeric. There were 2 values in the ‘torque’ feature from which we needed to extract the 1st part of the numeric value as ‘Torque’ and the latter part as ‘RPM’. The functions ‘str_sub’ and ‘str_locate’ form the package ‘stringr’ were used to perform this operation.
car_sales$torque_val <- as.numeric(str_sub(car_sales$torque, rep(1, nrow(car_sales)),
str_locate(car_sales$torque, "\\D+")[,1]-1))Extract RPM value from the torque column
The feature ‘torque’ contains nominal categorical values suffixed with imperial measurements of Newton meters along with ‘Revolutions per minute’. We need to extract the RPM values from this ‘torque’ feature into another feature which will be further utilized in analysis and have the data type of this new feature as numeric. We extracted the first part of torque as ‘Torque Value’ and needed to extract the second part as ‘RPM’ value. The functions of ‘str_sub’ and ‘gsub’ were used to perform these operations.
num_length <- str_length(gsub("\\D+", "", car_sales$torque))
car_sales$rpm <- as.numeric(str_sub(as.numeric(gsub("\\D+", "", car_sales$torque)),num_length-3, num_length))Extract brand from the car name column
The feature ‘name’ contains the name of the car, its manufacturer, and the model. We will extract the Brand name from this feature into another feature which will be further utilized in analysis. Now, there are several data points where the brand names contain single word and multiple words. We formulated the extraction to take out the words corresponding to the brand names these cars belong to.
# Extract the brand name (first word) into another feature from the 'name' feature.
car_sales$brand <- word(car_sales$name, start = 1, end = 1)
# Extract the brand name (first 2 words) into another feature from the 'name' feature.
car_sales$brand[car_sales$brand == "Ashok"] <- word(car_sales$name[car_sales$brand == "Ashok"], start = 1, end = 2)
car_sales$brand[car_sales$brand == "Land"] <- word(car_sales$name[car_sales$brand == "Land"], start = 1, end = 2)
car_sales$brand[car_sales$brand == "Range"] <- word(car_sales$name[car_sales$brand == "Range"], start = 1, end = 2)Filter out the values for CNG and LPG
The feature ‘fuel’ contains 4 different types namely Diesel, Petrol, CNG, LPG. We will eliminate those records which belong to CNG and LPG as their cumulative data points account for only 1.5% of the total records in the data set.
car_sales <- filter(car_sales, !fuel %in% c("LPG", "CNG"))Identify blank and NA values in the data
We further checked the data set for NA and blank values in all of its features.
colSums(is.na(car_sales))## name year selling_price km_driven fuel
## 0 0 0 0 0
## seller_type transmission owner mileage engine
## 0 0 0 214 214
## max_power torque seats torque_val rpm
## 208 0 214 214 214
## brand
## 0
Remove unnecessary columns in the dataframe
We removed some of the unnecessary columns from the data set as either informative values were extracted from them or were of no use.
car_sales$name <- NULL
car_sales$torque <- NULL
names(car_sales)[12] <- "torque"Observation - We found 214 records with NA values in most of the numeric features. These features need to be either removed or imputed by mean, median, etc.
Impute missing and NA values in the data using kNN
The data set contained few data points which contained missing or NA values in some of their features. We implemented the kNN algorithm to impute these missing and NA values with means of the features. This way we wouldn’t need to eliminate these records and can utilize them as well for our analysis. The functions ‘preProcess’ and ‘predict’ from the ‘caret’ package were used to perform these operations.
preProcValues <- preProcess(car_sales %>%
dplyr::select(year, selling_price, km_driven, fuel, seller_type,
transmission, owner, mileage, engine, max_power, seats, torque, rpm),
method = c("knnImpute"),
k = 20,
knnSummary = mean)
impute_cars_info <- predict(preProcValues, car_sales, na.action = na.pass)
procNames <- data.frame(col = names(preProcValues$mean), mean = preProcValues$mean, sd = preProcValues$std)
for(i in procNames$col){
car_sales[i] <- impute_cars_info[i]*preProcValues$std[i]+preProcValues$mean[i]
}
colSums(is.na(car_sales))## year selling_price km_driven fuel seller_type
## 0 0 0 0 0
## transmission owner mileage engine max_power
## 0 0 0 0 0
## seats torque rpm brand
## 0 0 0 0
Observation - No NA or missing values were found after the kNN imputation.
Create categorical column to indicate the size of the car
A new feature named ‘Car Type’ is created to indicate the size of the car. We formulated the feature ‘size’ to calculate the size of car based on the following rule set :
car_sales$car_type <- ifelse(car_sales$seats <= 5, "Small", ifelse(car_sales$seats <= 8, "Medium", ifelse(car_sales$seats <= 14, "Large", "Others")))
unique(car_sales$car_type)## [1] "Small" "Medium" "Large"
Create a numeric column to obtain the age of the car
We created another feature named ‘age’ which signifies the age of the car. The feature ‘year’ was utilised to calculate the age of all the cars.
car_sales$age <- as.numeric(format(Sys.Date(), "%Y")) - car_sales$year
ggplot(car_sales) +
geom_boxplot(mapping = aes(age))Unit Modification (Changing 1000s to 1s) of two features
The values in the features ‘Selling Price’ and ‘KMs driven’ are in hundreds of thousands. We reduced these values from 1000s to Ones in order for proper analysis and better scaling of visualisations.
car_sales$selling_price <- round( (car_sales$selling_price / 1000), 2)
car_sales$km_driven <- round( (car_sales$km_driven / 1000), 2)
#car_sales[sapply(car_sales, as.character)] <- lapply(car_sales[sapply(car_sales, as.character)], as.factor)Changing Character data type to Factor data type
The features belonging to character data types need to be changed into factor data type for efficient text manipulations.
car_sales$fuel <- as.factor(car_sales$fuel)
car_sales$seller_type <- as.factor(car_sales$seller_type)
car_sales$transmission <- as.factor(car_sales$transmission)
car_sales$owner <- as.factor(car_sales$owner)
car_sales$brand <- as.factor(car_sales$brand)
car_sales$car_type <- as.factor(car_sales$car_type)
#car_sales[sapply(car_sales, as.character)] <- lapply(car_sales[sapply(car_sales, as.character)], as.factor)Descriptive Statistics of the data set
# Descriptive Statistics of the data set.
mul_fun <- function(x) {
c(mean(x), sd(x), median(x), min(x), max(x), max(x)-min(x), quantile(x, 0.25),
quantile(x, 0.5), quantile(x, 0.75))
}
var_names <- c("Mean", "Std Dev", "Median", "Min", "Max", "Range", "Percentile 25", "Percentile 50",
"Percentile 75")
num_cols <- c("selling_price", "km_driven", "mileage", "engine", "max_power", "torque", "rpm", "age")
desc_stats <- as.data.frame(round(sapply(car_sales[num_cols], mul_fun), 2), row.names = var_names)
DT::datatable(desc_stats)Observation - From the statistics table, we found the below points :
Identify the Outliers in numeric variables
num_cols <- c("selling_price", "km_driven", "mileage", "engine", "max_power", "torque", "rpm", "age")
outlier_val <- sapply(car_sales[num_cols], function(x) boxplot.stats(x)$out)
min_outlier <- data.frame('Count of Outliers' = sapply(outlier_val, length), check.names = FALSE)
DT::datatable(min_outlier)Observation - The count of outliers corresponding to their features have been tabulated in the above table. The count of outliers in the feature ‘Selling Type’ is 600 and ‘engine’ has outliers in north of 1000. We found that ‘RPM’ has no outliers in the data set.
These outliers will need to be dealt with before building the regression models.
Frequency plot of the categorical variables
We have various categorical features which can provide insightful information about the data set. The features for which we have formulated the frequency tables are :
# Frequency Table of Brands
freqTable_fuel <- dplyr::count(car_sales, car_sales$fuel)
colnames(freqTable_fuel) <- c("Fuel Type", "Frequency")
freqTable_sellerType <- dplyr::count(car_sales, car_sales$seller_type)
colnames(freqTable_sellerType) <- c("Seller Type", "Frequency")
freqTable_transmission <- dplyr::count(car_sales, car_sales$transmission)
colnames(freqTable_transmission) <- c("Transmission", "Frequency")
freqTable_owner <- dplyr::count(car_sales, car_sales$owner)
colnames(freqTable_owner) <- c("Owner Type", "Frequency")
freqTable_carType <- dplyr::count(car_sales, car_sales$car_type)
colnames(freqTable_carType) <- c("Car Type", "Frequency")
bar_fuel <- ggplot(freqTable_fuel)+
geom_bar(
mapping = aes(
x = `Fuel Type`,
y = Frequency
),
stat="identity"
) +
theme(panel.background = element_rect("#a7f4fc")) +
labs(
title = "Plot 1: Fuel Type Distribution",
x = "Fuel Type",
y = "Frequency Count"
) +
theme(
plot.title = element_text(hjust = 0.5, colour = "#7F3D17", face = "bold"),
axis.text.x = element_text(vjust = 1, size = 10)
) +
scale_y_continuous(labels = comma) +
ylim(0, max(freqTable_fuel$Frequency) * 1.4) +
geom_text(aes(x = `Fuel Type`, y = Frequency, label = Frequency), vjust = 1.8, color = "WHITE", size = 4)
bar_sellerType <- ggplot(freqTable_sellerType)+
geom_bar(
mapping = aes(
x = `Seller Type`,
y = Frequency
),
stat="identity"
) +
theme(panel.background = element_rect("#a7f4fc")) +
labs(
title = "Plot 2: Seller Type Distribution",
x = "Seller Type",
y = "Frequency Count"
) +
theme(
plot.title = element_text(hjust = 0.5, colour = "#7F3D17", face = "bold"),
axis.text.x = element_text(vjust = 1, size = 10)
) +
scale_y_continuous(labels = comma) +
ylim(0, max(freqTable_sellerType$Frequency) * 1.4) +
geom_text(aes(x = `Seller Type`, y = Frequency, label = Frequency), vjust = 1.8, color = "WHITE", size = 4)
bar_transmission <- ggplot(freqTable_transmission)+
geom_bar(
mapping = aes(
x = Transmission,
y = Frequency
),
stat="identity"
) +
theme(panel.background = element_rect("#a7f4fc")) +
labs(
title = "Plot 1: Transmission Type Distribution",
x = "Transmission Type",
y = "Frequency Count"
) +
theme(
plot.title = element_text(hjust = 0.5, colour = "#7F3D17", face = "bold"),
axis.text.x = element_text(vjust = 1, size = 10)
) +
scale_y_continuous(labels = comma) +
ylim(0, max(freqTable_transmission$Frequency) * 1.4) +
geom_text(aes(x = Transmission, y = Frequency, label = Frequency), vjust = 1.8, color = "WHITE", size = 4)
bar_owner <- ggplot(freqTable_owner)+
geom_bar(
mapping = aes(
x = `Owner Type`,
y = Frequency
),
stat="identity"
) +
theme(panel.background = element_rect("#a7f4fc")) +
labs(
title = "Plot 1: Owner Type Distribution",
x = "Owner Type",
y = "Frequency Count"
) +
theme(
plot.title = element_text(hjust = 0.5, colour = "#7F3D17", face = "bold"),
axis.text.x = element_text(vjust = 1, size = 10, angle = 0)
) +
scale_y_continuous(labels = comma) +
ylim(0, max(freqTable_owner$Frequency) * 1.4) +
geom_text(aes(x = `Owner Type`, y = Frequency, label = Frequency), vjust = 1.8, color = "WHITE", size = 4)
bar_carType <- ggplot(freqTable_carType)+
geom_bar(
mapping = aes(
x = `Car Type`,
y = Frequency
),
stat="identity"
) +
theme(panel.background = element_rect("#a7f4fc")) +
labs(
title = "Plot 1: Car Type Distribution",
x = "Car Type",
y = "Frequency Count"
) +
theme(
plot.title = element_text(hjust = 0.5, colour = "#7F3D17", face = "bold"),
axis.text.x = element_text(vjust = 1, size = 10)
) +
scale_y_continuous(labels = comma) +
ylim(0, max(freqTable_carType$Frequency) * 1.4) +
geom_text(aes(x = `Car Type`, y = Frequency, label = Frequency), vjust = 1.8, color = "WHITE", size = 4)
grid.arrange(bar_fuel, bar_sellerType, bar_transmission, bar_owner, bar_carType)Observation - From the above graphs, we can figure out the following points :
Histogram and Box plot of numeric variables
In the following section, we’re plotting the histograms and boxplots to understand the uni-variate distribution of the numeric variables.
# Function to plot graph
univariate_plot <- function(x){
grid.arrange(ggplot(data = car_sales) + geom_histogram(mapping = aes(x)),
ggplot(data = car_sales) + geom_boxplot(mapping = aes(x)))
}
#Selling Price
univariate_plot(car_sales$selling_price)Observation - Most of the data is concentrated in the range of 0 to 1300 and any value above 1325 are outliers.
#Kilometer Driven
univariate_plot(car_sales$km_driven)Observation - Mahindra XUV500 W6 2WD and Hyundai i20 Asta 1.2 are outliers to the data set as they have been driven over 150,000 kms.
#Car mileage
univariate_plot(car_sales$mileage)Observation - There are 2 outliers in both of the extremes. Otherwise, the data seems to be normally distributed.
#Engine BHP
univariate_plot(car_sales$engine)Observation - A fair share of cars belong to the engine types of 1200 CC. Records in north of 2200 CC are outliers to the data set.
# Torque
univariate_plot(car_sales$torque)Observation - The data set seems to be normally distributed for the feature ‘torque’ with the presence of some outliers.
# RPM
univariate_plot(car_sales$rpm)Observation - There are no outliers in the data set for the feature ‘rpm’ and no legible distribution can be figured out from the data.
Check for the normality of numeric variables
Q-Q Plot is used to visualise the normality of the numeric data. If the ‘Sample’ and ‘Theoretical’ quantiles fall in the same line, those features can be considered normally distributed.
# Check Normality using Shapiro-Wilks Test
#shapiro.test(car_sales$selling_price)
# Function to plot graph
qq_plot <- function(numeric_feature, mainTitle) {
qqnorm(numeric_feature, pch = 5, frame = TRUE, main = mainTitle)
qqline(numeric_feature, col = "#52ffbf", lwd = 2)
}
# Changing Plot Matrix Size to 3x2.
par(mfrow = c(3,3))
# Check Normality using Q-Q Plot of 'Selling Price' Feature.
qq_plot(car_sales$selling_price, "Selling Price")
# Check Normality using Q-Q Plot of 'KMs Driven' Feature.
qq_plot(car_sales$km_driven, "KMs Driven")
# Check Normality using Q-Q Plot of 'Mileage' Feature.
qq_plot(car_sales$mileage, "Mileage")
# Check Normality using Q-Q Plot of 'Engine' Feature.
qq_plot(car_sales$engine, "Engine")
# Check Normality using Q-Q Plot of 'Max Power' Feature.
qq_plot(car_sales$max_power, "Max Power")
# Check Normality using Q-Q Plot of 'Torque' Feature.
qq_plot(car_sales$torque, "Torque")
# Check Normality using Q-Q Plot of 'RPM' Feature.
qq_plot(car_sales$rpm, "RPM")
# Resetting Plot Matrix Size to 1x1.
par(mfrow = c(1,1))Observation - The features ‘mileage’ and ‘KMs driven’ can be considered as normally distributed and other features will need to be treated to make them normally distributed.
Correlation matrix of numeric variables
Understanding the correlation between the numeric features of the data set.
#install.packages("ggcorrplot")
library(ggcorrplot)
data_corr <- car_sales[, c("selling_price", "km_driven", "mileage", "engine",
"max_power", "seats", "torque", "rpm", "age")]
corr <- round(cor(data_corr), 2)
ggcorrplot(corr, hc.order = TRUE, outline.col = "white",
ggtheme = ggplot2::theme_gray, colors = c("#6D9EC1", "white", "#E46726"))Observation - ‘Max Power’ seems to be highly correlated with ‘Torque’ and ‘Selling Price’.
Pairplot of all the numeric variables
Using pairplots, we can figure out the correlations (Positive, Negative, No Correlation) between the attributes (features) of the data set.
#install.packages("GGally")
library(GGally)
ggpairs(car_sales, columns = c(2, 3, 8:13), aes(color = transmission, alpha = 0.5))Observation - There are positive correlations between ‘Max Power’ and ‘Engine’. No correlations can be found between ‘Kms Driven’ and other attributes.
Yearly trend of price
Scatterplot between Kms driven and Selling price over the year. The plot has been shown in the form of GIF instead of static plot.
#install.packages("gganimate")
#install.packages("gifski")
#install.packages("av")
library(ggplot2)
library(gganimate)
# Make a ggplot, but add frame=year: one image per year
gif1 <- ggplot(car_sales, aes(km_driven, selling_price, color = transmission)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
theme_bw() +
# gganimate specific bits:
labs(title = 'Year: {frame_time}', x = 'Kilometers driven', y = 'Price of car') +
transition_time(as.integer(year)) +
ease_aes('linear')
# Save as GIF:
animate(gif1, nframes = 100, fps = 5, end_pause = 20, renderer=gifski_renderer("test.gif"))Observation - 1. The number of cars sold from 1983 till 2000 were very less and most of them were manual cars. 2. After 2005, the demand of automatic cars increased which peaked in 2015 overtaking manual cars.
Scatter plot between ‘Selling Price’ and ‘Mileage’ distributed by transmission type
# Make a ggplot (Scatter plot) of variables 'Mileage' and 'Selling Price'.
ggplot(data = car_sales, aes(x = mileage, y = selling_price, color = transmission)) +
geom_point() +
labs(title = "Scatter Plot of Cars' Mileage vs Selling Price", x = "Mileage", y = 'Selling Price') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 10)) +
geom_smooth(method = "lm")summary(lm(car_sales$selling_price ~ car_sales$mileage))##
## Call:
## lm(formula = car_sales$selling_price ~ car_sales$mileage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1002.5 -377.9 -171.2 71.1 9901.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1107.550 44.651 24.80 <2e-16 ***
## car_sales$mileage -24.029 2.261 -10.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 804.3 on 8031 degrees of freedom
## Multiple R-squared: 0.01387, Adjusted R-squared: 0.01374
## F-statistic: 112.9 on 1 and 8031 DF, p-value: < 2.2e-16
Scatter plot between ‘Engine’ and ‘Selling Price’ distributed by transmission type
# Make a ggplot (Scatter plot) of variables 'Engine' and 'Selling Price'.
ggplot(data = car_sales, aes(x = engine, y = selling_price, color = transmission)) +
geom_point() +
labs(title = "Scatter Plot of Cars' Engine vs Selling Price", x = "Engine", y = 'Selling Price') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 10)) +
geom_smooth(method = "lm")summary(lm(car_sales$selling_price ~ car_sales$engine))##
## Call:
## lm(formula = car_sales$selling_price ~ car_sales$engine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1650.8 -277.3 -51.1 125.3 8981.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -433.08399 24.77613 -17.48 <2e-16 ***
## car_sales$engine 0.73746 0.01606 45.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 720.8 on 8031 degrees of freedom
## Multiple R-squared: 0.2079, Adjusted R-squared: 0.2078
## F-statistic: 2108 on 1 and 8031 DF, p-value: < 2.2e-16
Scatter plot between ‘Max Power’ and ‘Selling Price’ distributed by transmission type
# Make a ggplot (Scatter plot) of variables 'Max Power' and 'Selling Price'.
ggplot(data = car_sales, aes(x = selling_price, y = max_power, color = transmission)) +
geom_point() +
labs(title = "Scatter Plot of Cars' Max-Power vs Selling Price", x = "Max-Power", y = 'Selling Price') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 10)) +
geom_smooth(method = "lm")summary(lm(car_sales$selling_price ~ car_sales$max_power))##
## Call:
## lm(formula = car_sales$selling_price ~ car_sales$max_power)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2733.0 -196.6 3.5 184.0 4238.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -916.502 16.485 -55.6 <2e-16 ***
## car_sales$max_power 17.052 0.168 101.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 536 on 8031 degrees of freedom
## Multiple R-squared: 0.562, Adjusted R-squared: 0.5619
## F-statistic: 1.03e+04 on 1 and 8031 DF, p-value: < 2.2e-16
Scatter plot between ‘Selling Price’ and ‘Owner Type’ distributed by transmission type
# Make a ggplot (Scatter plot) of variables and 'Owner Type' and 'Selling Price'.
ggplot(data = car_sales, aes(x = owner, y = selling_price, color = transmission)) +
geom_point() +
labs(title = "Scatter Plot of Cars' Owner Type vs Selling Price", x = "Owner Type", y = 'Selling Price') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 10)) +
geom_smooth(method = "lm")summary(lm(car_sales$selling_price ~ car_sales$owner))##
## Call:
## lm(formula = car_sales$selling_price ~ car_sales$owner)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3053.8 -337.6 -162.6 34.3 9212.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 787.60 10.75 73.246 <2e-16 ***
## car_sales$ownerFourth & Above Owner -560.29 60.65 -9.238 <2e-16 ***
## car_sales$ownerSecond Owner -391.70 20.19 -19.397 <2e-16 ***
## car_sales$ownerTest Drive Car 3616.20 348.20 10.385 <2e-16 ***
## car_sales$ownerThird Owner -501.86 34.97 -14.352 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 778.2 on 8028 degrees of freedom
## Multiple R-squared: 0.07707, Adjusted R-squared: 0.07661
## F-statistic: 167.6 on 4 and 8028 DF, p-value: < 2.2e-16
Scatter plot between ‘Brand’ and ‘Selling Price’ distributed by transmission type
# Make a ggplot (Scatter plot) of variables 'Brand' and 'Selling Price'.
ggplot(data = car_sales, aes(x = brand, y = selling_price, color = transmission)) +
geom_point() +
labs(title = "Scatter Plot of Cars' Brands vs Selling Price", x = "Brands", y = 'Selling Price') +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 10)) +
geom_smooth(method = "lm")#summary(lm(car_sales$brand ~ car_sales$selling_price))Hypothesis testing is used to validate an assumption regarding the population parameter, which can be generalized. Based on the sample size, distribution and the sample statistics know, different types of hypothesis tests can be employed. The general steps involved in hypothesis testing are:- 1. Formulate the NULL and alternative hypothesis 2. Plan the test to be performed and decide the critical value 3. Perform the test and obtain the test statistics 4. Reject the NULL hypothesis or state that the null hypothesis is plausible
One Sample t-Test of the kilometers driven by ‘Individual’ Seller-type
NULL HYPOTHESIS, H0 : True Mean of kilometers driven by ‘Individual’ seller-type is equal to the overall average.
ALTERNATE HYPOTHESIS, H1 : True Mean of kilometers driven by ‘Individual’ seller-type is greater than the overall average.
ALTERNATIVE : Greater
# Average of 'Km_driven' across the data set.
mean_km_driven <- mean(car_sales$km_driven)
# One-Sample t-test for kilometers driven by 'Individual' Seller type with its mean.
ttest <- t.test(car_sales$km_driven[car_sales$seller_type == "Individual"], mu = mean_km_driven, alternative = "greater", conf.level = .95)
ttest##
## One Sample t-test
##
## data: car_sales$km_driven[car_sales$seller_type == "Individual"]
## t = 7.2395, df = 6672, p-value = 2.503e-13
## alternative hypothesis: true mean is greater than 69.73881
## 95 percent confidence interval:
## 73.79173 Inf
## sample estimates:
## mean of x
## 74.98352
format(ttest$p.value, scientific = FALSE)## [1] "0.0000000000002503328"
DEGREE OF FREEDOM : 6672 P-VALUE : 2.5033283^{-13}
Observation - Since, the p-value of our One-Sample t-Test is 2.5033283^{-13}, which is less than our alpha = 0.05, we reject the Null Hypothesis of the test. This means that we have sufficient evidence to say that Population True Mean of the kilometers driven by ‘Individual’ seller-type in the data set is greater than the True Mean of kilometers driven by all seller-types in the data set.
One Sample t-Test of the selling price of cars sold by ‘Dealer’ Seller-type
NULL HYPOTHESIS, H0 : True Mean of selling price of cars sold by ‘Dealer’ seller-type is equal to the overall average.
ALTERNATE HYPOTHESIS, H1 : True Mean of selling price of cars sold by ‘Dealer’ seller-type is greater than the overall average.
ALTERNATIVE : Greater
# Average of 'selling_price' across the data set.
mean_selling_price <- mean(car_sales$selling_price)
# One-Sample t-test for selling price of cars sold by 'Dealer' Seller type with its mean.
ttest <- t.test(car_sales$selling_price[car_sales$seller_type == "Dealer"], mu = mean_selling_price, alternative = "greater", conf.level = .95)
ttest##
## One Sample t-test
##
## data: car_sales$selling_price[car_sales$seller_type == "Dealer"]
## t = 17.151, df = 1123, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 642.7362
## 95 percent confidence interval:
## 1368.685 Inf
## sample estimates:
## mean of x
## 1445.764
format(ttest$p.value, scientific = FALSE)## [1] "0.00000000000000000000000000000000000000000000000000000000004835655"
DEGREE OF FREEDOM : 1123
P-VALUE : 4.8356552^{-59}
Observation - Since, the p-value of our One-Sample t-Test is 4.8356552^{-59}, which is less than our alpha = 0.05, we reject the Null Hypothesis of the test. This means that we have sufficient evidence to say that Population True Mean of the selling price of cars sold by ‘Dealer’ seller-type in the data set is greater than the True Mean of selling price of cars sold by all seller-types in the data set.
The feature ‘Owner’ contains values like ‘First Owner, Second Owner, Third Owner, Fourth Owner, Test Drive Cars’. We removed the ‘Test Drive’ cars from the data set to make it more consumer centric and also because there record quantity is very less when compared with others in the data set.
# Remove cars belonging to 'Test Drive' owners.
car_sales <- filter(car_sales, !owner %in% c("Test Drive Car"))One Sample t-Test of the kilometers driven by ‘First Owners’ of the cars
NULL HYPOTHESIS, H0 : True Mean of kilometers driven by ‘First Owner’ owner-type is equal to the overall average.
ALTERNATE HYPOTHESIS, H1 : True Mean of kilometers driven by ‘First Owner’ owner-type is less than the overall average.
ALTERNATIVE : Less
# Average of 'Km_driven' across the data set.
mean_km_driven <- mean(car_sales$km_driven)
# One-Sample t-test for kilometers driven by 'First Owners' of the cars with its mean.
ttest <- t.test(car_sales$km_driven[car_sales$owner == "First Owner"], mu = mean_km_driven, alternative = "less", conf.level = .95)
ttest##
## One Sample t-test
##
## data: car_sales$km_driven[car_sales$owner == "First Owner"]
## t = -18.941, df = 5237, p-value < 2.2e-16
## alternative hypothesis: true mean is less than 69.77313
## 95 percent confidence interval:
## -Inf 58.43108
## sample estimates:
## mean of x
## 57.35225
format(ttest$p.value, scientific = FALSE)## [1] "0.000000000000000000000000000000000000000000000000000000000000000000000000000009682691"
DEGREE OF FREEDOM : 5237
P-VALUE : 9.6826906^{-78}
Observation - Since, the p-value of our One-Sample t-Test is 9.6826906^{-78}, which is less than our alpha = 0.05, we reject the Null Hypothesis of the test. This means that we have sufficient evidence to say that Population True Mean of the kilometers driven by ‘First Owner’ owner-type in the data set is less than the True Mean of kilometers driven by ‘First Owner’ owner-types in the data set.
One Sample t-Test of the selling price of cars sold owned by ‘First Owners’ of the cars
NULL HYPOTHESIS, H0 : True Mean of selling price of cars sold by ‘First Owner’ owner-type is equal to the overall average.
ALTERNATE HYPOTHESIS, H1 : True Mean of selling price of cars sold by ‘First Owner’ owner-type is greater than the overall average.
ALTERNATIVE : Greater
# Average of 'selling_price' across the data set.
mean_selling_price <- mean(car_sales$selling_price)
# One-Sample t-test for selling price of cars sold owned by 'First Owners' of the cars with its mean.
ttest <- t.test(car_sales$selling_price[car_sales$owner == "First Owner"], mu = mean_selling_price, alternative = "greater", conf.level = .95)
ttest##
## One Sample t-test
##
## data: car_sales$selling_price[car_sales$owner == "First Owner"]
## t = 11.39, df = 5237, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 640.3937
## 95 percent confidence interval:
## 766.3377 Inf
## sample estimates:
## mean of x
## 787.5987
print(paste("P-value :", format(ttest$p.value, scientific = FALSE)))## [1] "P-value : 0.000000000000000000000000000005204048"
DEGREE OF FREEDOM : 5237
P-VALUE : 5.2040483^{-30}
Observation - Since, the p-value of our One-Sample t-Test is 5.2040483^{-30}, which is less than our alpha = 0.05, we reject the Null Hypothesis of the test. This means that we have sufficient evidence to say that Population True Mean of the selling price of cars sold by ‘First Owner’ owner-type in the data set is greater than the True Mean of selling price of cars sold by all owner-type in the data set.
Two Sample t-Test to test equality of the kilometers driven by ‘Small & Medium’ Car-types
NULL HYPOTHESIS, H0 : Mean kilometers driven for small cars (car_type = small) is equal to the kilometers driven for medium cars. mean(small car_type) = mean(medium car_type)
ALTERNATE HYPOTHESIS, H1 : Mean kilometers driven for small cars (car_type = small) is not equal to the kilometers driven for medium cars. mean(small car_type) != mean(medium car_type)
ALTERNATIVE : Two.Sided
# Two-Sample t-test to test equality of the kilometers driven by 'Small & Medium' Car-types.
ttest <- t.test(car_sales$km_driven[car_sales$car_type == "Small"], car_sales$km_driven[car_sales$car_type == "Medium"], alternative = "two.sided", conf.level = .95)
ttest##
## Welch Two Sample t-test
##
## data: car_sales$km_driven[car_sales$car_type == "Small"] and car_sales$km_driven[car_sales$car_type == "Medium"]
## t = -16.422, df = 1998.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -33.83369 -26.61471
## sample estimates:
## mean of x mean of y
## 63.33297 93.55717
format(ttest$p.value, scientific = FALSE)## [1] "0.000000000000000000000000000000000000000000000000000000006046862"
DEGREE OF FREEDOM : 1998.3096793
P-VALUE : 6.0468621^{-57}
Observation - Since, the p-value of our Two-Sample t-Test is 6.0468621^{-57}, which is less than our alpha = 0.05, we reject the Null Hypothesis of the test. This means that we have sufficient evidence to say that Mean kilometers driven for small cars (car_type = small) is not equal to the kilometers driven for medium cars in the data set.
Two Sample t-Test to test equality of the kilometers driven by ‘Dealer & Trustmark Dealer’ Seller-types
NULL HYPOTHESIS, H0 : Mean kilometers driven for dealers (seller_type = dealer) is equal to the kilometers driven for trustmark dealer. mean(dealer seller_type) = mean(trustmark dealer seller_type)
ALTERNATE HYPOTHESIS, H1 : Mean kilometers driven for dealers (seller_type = dealer) is not equal to the kilometers driven for trustmark dealer. mean(dealer seller_type) != mean(trustmark dealer seller_type)
ALTERNATIVE : Two.Sided
# Two-Sample t-Test to test equality of the kilometers driven by 'Dealer & Trustmark Dealer' Seller-types.
ttest <- t.test(car_sales$km_driven[car_sales$seller_type == "Dealer"], car_sales$km_driven[car_sales$seller_type == "Trustmark Dealer"], alternative = "two.sided", conf.level = .95)
ttest##
## Welch Two Sample t-test
##
## data: car_sales$km_driven[car_sales$seller_type == "Dealer"] and car_sales$km_driven[car_sales$seller_type == "Trustmark Dealer"]
## t = 1.1005, df = 377.67, p-value = 0.2718
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.767033 6.258865
## sample estimates:
## mean of x mean of y
## 44.50460 42.25869
format(ttest$p.value, scientific = FALSE)## [1] "0.2718353"
DEGREE OF FREEDOM : 377.6718623
P-VALUE : 0.2718353
Observation - Since, the p-value of our Two-Sample t-Test is 0.2718353, which is not less than our alpha = 0.05, we cannot reject the Null Hypothesis of the test. This means that we do not have sufficient evidence to say that Mean kilometers driven for dealers (seller_type = dealer) is not equal to the kilometers driven for trustmark dealer in the data set.
The data set of car sales by CarDekho.com has provided various insights about the types of cars sold in the car industry and the patterns between them. The data set contains 8128 data points along with 13 features related to car details, engine details, and sale details.
In this assignment, we used one sample t-test and two sample t-test to compare the variables ‘km_driven’ and ‘selling_price’ with the overall sample average and compare across two groups respectively. Based on the exploratory analysis performed earlier, we wanted to validate the following hypotheis using t-test:
In the first five test, based on the t-statistic and p-value we obtained a enough evidence to reject the NULL hypothesis, however in the last test, we did not obtain enough evidence to reject the NULL hypothesis.
Vehicle dataset. (2020, October 24). Kaggle. https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho
Xie, Y. C. D. (2021, October 7). R Markdown Cookbook. R Markdown. Retrieved October 30, 2021, from https://bookdown.org/yihui/rmarkdown-cookbook/
Bluman, A. (2017). Elementary Statistics: A Step By Step Approach (10th ed.). McGraw-Hill Education.
Kabacoff, R., I. (2022). R in Action, Third Edition. Manning.
CarDekho. (2021). About Us | CarDekho.com. https://www.cardekho.com/info/about_us
F. (2021, April 2). tidyverse in r – Complete Tutorial. R-Bloggers. Retrieved October 30, 2021, from https://www.r-bloggers.com/2021/04/tidyverse-in-r-complete-tutorial/
D. (2021a, March 22). Data Analytics for Car Dealers. Automated Metrics. https://www.automatedmetrics.io/data-analytics-for-car-dealers/
The RMD file of the analysis is included with the analysis report.